Columbia Newsblaster: Multilingual News Summarization on the Web

نویسندگان

  • David Kirk Evans
  • Judith L. Klavans
  • Kathleen McKeown
چکیده

We propose to show the new multilingual version of the Columbia Newsblaster news summarization system. The system addresses the problem of user access to browsing news in multiple languages from multiple sites on the internet. The system automatically collects, organizes, and summarizes news in multiple source languages, allowing the user to browse news topics with English summaries, and compare perspectives from different countries on the topics. 1 Multilingual Columbia Newsblaster We propose to present a demo of the multilingual version of Columbia Newsblaster which is currently in development. Columbia Newsblaster1 (McKeown et al., 2002) is a system for news browsing that crawls news from the web, clusters related articles, summarizes the multidocument clusters, and presents the clusters in a browsing interface separated into pre-defined categories. In this demo, participants will be able to • Browse summaries of current news from multiple languages • View news clusters with documents from a particular language • Compare summaries from documents in different languages The Multilingual version of Columbia Newsblaster is built upon the English version of Columbia Newsblaster, taking advantage of the existing system by translating documents into English. The system has six major phases: crawling, article extraction, clustering, summarization, classification, and interface generation. In http://newsblaster.cs.columbia.edu/ Figure 1: Multilingual CU Newsblaster Architecture. the demo, we will show the new translation phase, and changes made to encode all text in UTF8 Unicode. Figure 1 depicts the multilingual Columbia Newsblaster architecture, highlighting the changes made to support multiple languages. The multilingual version of Columbia Newsblaster crawls web sites in foreign languages and extracts article text from the HTML pages. Non-English documents are translated and clustered with English documents using the existing document clustering system. For translation we use an interface to the babelfish translation system, and an interface to a statistical Arabic machine translation system from IBM for Arabic documents. The resulting document clusters are then summarized, with different summaries created for documents based on country and language. Extracting article text Our previous approach to extracting article text was hand crafted for English sites; to support non-English sites we incorporated a new article extraction module that uses machine learning techniques to identify the article text. The new module parses HTML into blocks of text and computes a set of 34 simple text features for each text block. Training data is generated using a GUI, and Ripper (Cohen, 1996) is used to induce a hypothesis for categorizing text. The is currently working with sites in English, Russian, Japanese, Chinese, French, Spanish, GerFigure 2: A screen shot comparing a summary from English documents to a summary from German documents. man, Italian, Portuguese, Korean, and Arabic. Over an English training set composed of 353 articles, the extractor had 89% recall and 90% precision. It had similar performance for Russian and Japanese training sets. Using rules learned for one language on a different data set significantly degraded performance, showing that the system is able to adapt to different sites and languages. Adding new sites is easy; a human annotates web pages using the GUI, and a new categorization hypothesis is learned from the new training data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Platform for Multilingual News Summarization

We have developed a multilingual version of Columbia Newsblaster as a testbed for multilingual multi-document summarization. The system collects, clusters, and summarizes news documents from sources all over the world daily. It crawls news sites in many different countries, written in different languages, extracts the news text from the HTML pages, uses a variety of methods to translate the doc...

متن کامل

Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster

Recently, there have been significant advances in several areas of language technology, including clustering, text categorization, and summarization. However, efforts to combine technology from these areas in a practical system for information access have been limited. In this paper, we present Columbia’s Newsblaster system for online news summarization. Many of the tools developed at Columbia ...

متن کامل

Columbia's Newsblaster: New Features and Future Directions

Columbia’s Newsblaster tracking and summarization system is a robust system that clusters news into events, categorizes events into broad topics and summarizes multiple articles on each event. Here we outline our most current work on tracking events over days, producing summaries that update a user on new information about an event, outlining the perspectives of news coming from different count...

متن کامل

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization

We describe a task-based evaluation to determine whether multi-document summaries measurably improve user performance when using online news browsing systems for directed research. We evaluated the multi-document summaries generated by Newsblaster, a robust news browsing system that clusters online news articles and summarizes multiple articles on each event. Four groups of subjects were asked ...

متن کامل

Columbia University at Mse 2005

We describe our participation in the Multilingual Summarization Evaluation 2005. We describe the Columbia summarizers that were used in our submission and discuss the evaluation, drawing conclusions about the performance of our summarizers, discussing the state of multilingual summarization in general and also listing issues that need consideration for future evaluations.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004